Run Data Profiling

Enabling DQLabs in erwin Data Intelligence (erwin DI) allows erwin Data Quality APIs to pull environment connection information from erwin DI and create catalogs in erwin Data Quality. You can add datasets and views to the catalogs and run data profiling.

Once the data quality analysis is complete, erwin Data Quality displays data quality analysis for catalogs (environments), datasets (tables), attributes (columns), and views. You can sync the analysis results to erwin DI and view them in the Metadata Manager.

Data quality analysis is available for environments using Oracle, Salesforce, Snowflake, MySQL, MSSQL, Hadoop, and PostgreSQL database types.

Before running data profiling, ensure the following pre-requisites are met:

  • Configure erwin Data Quality in erwin DI. To configure erwin Data Quality, refer to the Configuring Data Profiling topic.
  • Switch the Enable DQ Sync option on for the environment. To enable this option, refer to the Managing Environments topic.

This topic walks you through adding datasets and views, and running data profiling on the catalog for Sql Server database as an example. Similarly, you can run data profiling for other databases.

To run data profiling on a catalog, follow these steps:

  1. Go to Application Menu > Data Quality.
    Your erwin Data Quality instance opens. Login to erwin Data Quality (if asked).
  2. On the erwin Data Quality menu, click (Admin User).
  3. Under the admin user icon, click Settings.
  4. The Settings page appears.

  5. Click to expand menu.
  6. The menu expands and displays additional options.

  7. Expand Connect and click Sources.

  8. The Sources page appears and displays a list of connections.

  9. In the Actions column, click for the connection you want to modify. For example, TechPub_DQlab_SQL_Server.

  10. The Connection Details page appears.

  11. Enter or select appropriate values in the fields. Fields marked with a red asterisk are mandatory. Refer to the following table for field descriptions:
  12. Field Name

    Description

    Connection Name

    Specifies the name of the connection.

    For example, Sql Server.

    This field is auto-populated with the connection name which now includes the system name appended to the environment name. You can edit the name as required.

    Description

    Specifies the description of the connection.

    Server

    Specifies the server number. This field is auto-populated with the server number.

    Port

    Specifies the port number. This field is auto-populated with the port number.

    Database

    Specifies the name of the database. This field is auto-populated with the database name.

    Authentication Type

    Specifies the type of authentication.

    • Username and Password: Specifies the standard authentication method where users log in using a unique username and a secure password.

    • Windows Authentication: Specifies the authentication method that leverages Windows credentials to authenticate users based on their Windows login information automatically.

    Use vault

    Enables retrieval of authentication credentials from a vault. Select this checkbox to retrieve credentials from a vault instead of entering them manually.

    User

    Enter the username.

    For example, sa

    Password

    Enter the password.

    Schema

    Specifies the name of the schema. This option is available only by clicking Validate after entering your username and password.

    Supported Languages

    Specifies whether European characters are recognized during data profiling. Switch the European option on to include European characters.

    Select Vault

    Specifies the vault from which credentials are retrieved. This option is available only when you select the Use vault checkbox.

    Vault Key

    Specifies the key used to access credentials in the selected vault. This option is available only when you select the Use vault checkbox.

  13. Click Connect.
  14. The catalog configuration page appears and displays datasets and views in the catalog.

  15. Select the required datasets and views, then click Connect.
  16. You can use the checkbox at the top to select all datasets and views.

    A Confirmation pop-up appears.

  17. Click Yes.
  18. Data quality analysis might take some time, depending on the size of data in the catalog. You can click the View logs icon to view the status of the dataset and view.

  19. On the catalog configuration page, under the Actions column, use the following options:
  20. Schedule ( ): Use this option to schedule a data profiling job at predefined intervals.

    Delete (): Use this option to delete the dataset or view.

    After profiling data, the Discover menu opens and displays the Asset page. This page shows the data quality analysis of all assets in a card format including asset name, attributes, rows, DQ scores, alerts, and issues.

    You can drill down to view detailed information. For example, click the Alphabetical list of products card.

    The Observe menu opens and displays the Measures section, which provides the Data Quality (DQ) score along with detailed information about the dataset.

    Also, you can drill down to view DQ scores for attributes.

  21. Click the Attributes tab to view a list of all attributes and DQ scores.
  22. You can sync these results to erwin DI and view them in Metadata Manager. To sync data quality analysis results, you need to schedule a sync job. For more information, refer to the Scheduling Jobs topics.

You can further drill down to view the detailed analysis of an attribute. To view the detailed analysis of an attribute, follow these steps:

  1. Under the Attributes tab, click the Select Attribute filter and select the required attribute for which you want to view the detailed analysis. Alternatively, you can click attribute under List of all attributes.
  2. The Properties tab for the attribute appears. Also, you can view a comprehensive analysis of the data in tabular format and charts under the Profile section. For more information on charts, refer to the Data Quality Charts topic in the erwin Data Quality user guide.